Model Selection

Multimodal Instruction Fine-tuning

# Multimodal Instruction Fine-tuning

Llama 3.2 11B Vision Radiology Mini

This is a multimodal model based on the Llama architecture, supporting vision and text instructions, optimized with 4-bit quantization.

Smolvlm2 2.2B Instruct 4bit

SmolVLM2-2.2B-Instruct-4bit is a vision-language model based on MLX format conversion, focusing on video text-to-text tasks.

Transformers English

Kowen Vol 1 Base 7B

A Korean vision-language model based on Qwen2-VL-7B-Instruct, supporting image-to-text tasks

Transformers Korean

Med-CXRGen-I is a multimodal large language model fine-tuned based on LLaVA-v1.5-7B, specializing in the task of generating radiology reports from chest X-ray images, particularly the impression section.

Med-CXRGen-F is a multimodal large language model fine-tuned based on LLaVA-v1.5-7B, specifically designed for radiology report generation tasks, particularly the automatic generation of chest X-ray examination results.

Qwen2 VL 7B SafeRLHF

Qwen2-VL-7B-Instruct is a multimodal large language model fine-tuned on the SafeRLHF dataset, focusing on visual question answering tasks with an emphasis on safety.

Safetensors English

Xgen Mm Phi3 Mini Instruct Dpo R V1.5

xGen-MM is a series of multimodal foundation models developed by Salesforce AI Research, improved based on the BLIP series, and trained on high-quality image captions and interleaved image-text data.

Safetensors English

ChartGemma is a chart understanding and reasoning model built upon PaliGemma, capable of directly processing chart images through visual instruction fine-tuning to capture visual trends and underlying information.

Transformers English

Xgen Mm Phi3 Mini Instruct R V1

xGen-MM is the latest foundational large multimodal model series developed by Salesforce AI Research, based on improvements to the BLIP series, featuring powerful image understanding and text generation capabilities.

Transformers English

Llava Med 7b Delta

LLaVA-Med is a biomedical multimodal model constructed through visual instruction fine-tuning, capable of processing biomedical images and text.

OTTER MPT7B Init

OTTER-MPT7B-Init is a set of weights for initializing Otter model training, converted directly from Openflamingo.

Blip Image Captioning

This is an image captioning model based on the BLIP architecture, capable of generating concise textual descriptions for input images.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase